feat(AMD): support AMD GPUs (ROCm/HIP)#354
Open
RixinLiu wants to merge 8 commits into
Open
Conversation
Route all GPU virtual-memory calls through a new compile-time HIP/CUDA abstraction so the same code builds and runs on both NVIDIA (CUDA driver API) and AMD (HIP runtime). - csrc/inc/gpu_vmm.hpp: new backend-neutral VMM wrappers, dispatched by KVCACHED_USE_HIP / KVCACHED_USE_CUDA; adds mem_get_info + device_synchronize. - cuda_utils.hpp -> gpu_utils.hpp: check macros route to gpu_vmm::check; keeps the LOGGER stack. - page/ftensor/allocator/page_allocator/torch_bindings: use gpu_vmm and lower torch:: types to c10::/at:: (drop the torch/extension.h umbrella). - setup.py: detect torch.version.hip; build CppExtension(+amdhip64) for ROCm, CUDAExtension(+cuda) for NVIDIA. - bench_vmm: build for either backend (make KVCACHED_BACKEND=hip). Verified on AMD Instinct MI300X (ROCm 7.2): bench_vmm, full extension build, and a python smoke test (init -> create -> map -> GPU r/w -> unmap) all pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Make the elastic pools/allocators accept both 'cuda' and 'hip' device strings via _is_supported_gpu_device (no functional change on ROCm, which reports 'cuda'). Also sync the bench_vmm README for the HIP backend and drop the orphaned bench cuda_utils.hpp. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
vLLM's ROCm attention backend (split_kv_cache + paged kernels) cannot read the strided per-layer KV tensors that kvcached's contiguous (compound-page) layout produces; CUDA's FlashAttention/FlashInfer tolerate it. Auto-default CONTIGUOUS_LAYOUT=false when torch is a HIP build (explicit env still wins) so AMD is correct out of the box, plus a README note. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drives the vLLM offline engine with kvcached and watches the mapped KV footprint via the /dev/shm IPC: it grows on load (mem_map) and shrinks on free (mem_unmap) with output unchanged. Complements the manager-level test_kvcache_manager.py. Validated on AMD MI300X (ROCm 7.0); runs on NVIDIA too (device "cuda:0"). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…0.5.12) Bundles the SGLang-compat fixes (also submitted standalone as the sglang-0512-compat PR) so SGLang-on-AMD works from this branch alone: - version_utils: fall back to importlib.metadata when sglang exposes no module-level __version__ (source builds), so the patches don't silently no-op. - patches: generalize the scheduler_memory_leak patch across SGLang's leak-check layouts (old single method / new SchedulerRuntimeCheckerMixin), skipping the req_to_token_pool-specific check kvcached must not silence. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
samples needed an element type; seg_name=[None] inferred as list[None], which made seg_name[0] non-indexable under mypy. Verified against the CI mypy-3.10 hook (and ruff/isort/codespell) locally. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a compile-time GPU backend abstraction so kvcached can run on AMD GPUs via ROCm/HIP while preserving the existing CUDA path, and updates integrations/docs/tests accordingly.
Changes:
- Introduces a HIP/CUDA VMM dispatch layer (
gpu_vmm.hpp) and rewires core C++ components to use it. - Updates build tooling (
setup.py, benchmark Makefile) to select/link the correct backend (CUDA vs ROCm/HIP). - Extends integrations (vLLM/SGLang) and adds an end-to-end elasticity-under-load script/test plus documentation for ROCm’s default KV layout.
Reviewed changes
Copilot reviewed 26 out of 26 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_elastic_serving.py | Adds an end-to-end script/test to observe KV mapping growth/shrink under load via IPC stats. |
| setup.py | Detects CUDA vs ROCm from PyTorch and builds/links with the appropriate extension type and libraries. |
| README.md | Documents ROCm defaulting to non-contiguous KV layout and rationale. |
| kvcached/utils.py | Defaults KV layout to non-contiguous on ROCm unless explicitly overridden. |
| kvcached/integration/vllm/interfaces.py | Adjusts GPU availability assertion message (integration path touched). |
| kvcached/integration/version_utils.py | Adds metadata fallback for version detection when module attributes are missing. |
| kvcached/integration/sglang/patches.py | Broadens device-string acceptance and generalizes scheduler leak-check patching. |
| kvcached/integration/sglang/interfaces.py | Adjusts GPU availability assertion message (integration path touched). |
| csrc/torch_bindings.cpp | Switches to at::Tensor and updates Torch/pybind includes for bindings. |
| csrc/page.cpp | Ports GPU page allocation/mapping to the new backend-agnostic VMM layer. |
| csrc/page_allocator.cpp | Uses backend-agnostic mem-info and sync calls during unmap/availability computations. |
| csrc/inc/torch_utils.hpp | Changes dtype helpers to c10::ScalarType declarations and header includes. |
| csrc/inc/page.hpp | Replaces CUDA-specific types with backend-agnostic VMM handle/access helpers. |
| csrc/inc/page_allocator.hpp | Removes CUDA-specific includes and adjusts headers for new usage. |
| csrc/inc/mem_info_tracker.hpp | Updates to use generalized GPU utilities header. |
| csrc/inc/impl/torch_utils.ipp | Converts dtype mapping helpers to c10::ScalarType. |
| csrc/inc/gpu_vmm.hpp | New HIP/CUDA abstraction for VMM operations and error handling. |
| csrc/inc/gpu_utils.hpp | Generalizes logging/check macros to route through the new VMM abstraction. |
| csrc/inc/ftensor.hpp | Replaces torch::* types with at::Tensor / c10::* types. |
| csrc/inc/allocator.hpp | Replaces torch::* types with at::Tensor / c10::* types and renames init helper. |
| csrc/ftensor.cpp | Ports virtual address reservation/mapping/unmapping to backend-agnostic VMM calls. |
| csrc/allocator.cpp | Ports allocator init and tensor creation paths to backend-agnostic VMM calls/types. |
| benchmarks/bench_vmm/README.md | Updates benchmark docs to reflect CUDA+HIP support and renamed operations. |
| benchmarks/bench_vmm/Makefile | Adds backend selection (cuda vs hip) and links the correct driver library. |
| benchmarks/bench_vmm/cuda_utils.hpp | Removes CUDA-only utilities now superseded by shared GPU utilities. |
| benchmarks/bench_vmm/bench_vmm.cpp | Ports benchmark implementation to the backend-agnostic VMM utilities. |
Comments suppressed due to low confidence (2)
kvcached/integration/vllm/interfaces.py:200
deviceis now allowed to be a HIP-style string (e.g. "hip:0"), but this function still passes it directly totorch.cuda.get_device_properties(...)and tocreate_kv_tensors(...). PyTorch does not recognize ahipdevice type (ROCm GPUs are exposed ascuda), so this will raise and/or break the C++ extension device parsing whendevicestarts with "hip".
assert torch.cuda.is_available(), "GPU backend is not available via torch.cuda."
# --- Compute per-layer memory budget and number of blocks ---
gpu_mem_bytes = torch.cuda.get_device_properties(device).total_memory
gpu_mem_bytes_per_layer_k_or_v = gpu_mem_bytes // num_layers // num_k_or_v
kvcached/integration/sglang/interfaces.py:100
- This function now accepts HIP-style device strings ("hip:0"), but still passes
devicedirectly totorch.cuda.get_device_properties(...)andcreate_kv_tensors(...). PyTorch ROCm builds expose HIP devices ascuda, so ahip:*device string will not be understood by either PyTorch or the C++ extension’sc10::Deviceparsing.
assert torch.cuda.is_available(), "GPU backend is not available via torch.cuda."
# SGLang named it "page" to be consistent with PagedAttention. But we call
# it "block" to distinguish a KV cache block and a physical memory page.
block_size = page_size
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- page_allocator: check mem_get_info status via CHECK_GPU instead of discarding it with (void); a failed call previously computed a page count from uninitialized sizes. Zero-init the sizes too. - torch_utils.hpp: include <pybind11/pybind11.h> so the header is self-contained -- it declares functions taking py::object but relied on the includer pulling in pybind11 first. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The integration accepts `hip` device strings, but PyTorch-ROCm and the C++ extension (c10::Device) address AMD GPUs as `cuda`, so a literal `hip:0` would fail in torch.cuda.get_device_properties / create_kv_tensors / the C++ init. Add a shared normalize_gpu_device() helper and apply it at every device entry point in both the vLLM and SGLang integrations (init_kvcached + alloc paths). No-op for the `cuda` strings the engines actually pass on ROCm; verified with a vLLM generate smoke on MI300X. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
cui36
approved these changes
Jun 4, 2026
Collaborator
|
Reproduced the experiments on AMD. LGTM. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Ports kvcached to AMD GPUs (ROCm/HIP). The KV cache, its C++ page allocator, and the GPU virtual-memory operations that grow/shrink it now build and run on both NVIDIA (CUDA driver API) and AMD (HIP), selected at compile time. Validated end-to-end on an AMD Instinct MI300X (ROCm 7.0): builds, serves dense models with output identical to the no-kvcached baseline, and the elastic KV cache physically grows and shrinks under load. NVIDIA is unaffected (CUDA non-regression checked).
Provenance
The AMD support was originally developed on the
amd-support-initbranch. Merging that branch directly intomainwas not reasonable (it had diverged), so its changes were re-applied as fresh, self-contained commits on this branch (amd-support-port, branched off the currentmain). Functionally this is that work, reconstructed cleanly on top ofmain.What changed
csrc/inc/gpu_vmm.hpp(new, ~255 lines) — a compile-time HIP/CUDA dispatch layer. Everything calls avmm::namespace (address_reserve,mem_create,mem_map,set_access,mem_unmap, …) that resolves tohipMem*orcuMem*based onKVCACHED_USE_HIP/KVCACHED_USE_CUDA.cuda_utils.hpp→gpu_utils.hpp— renamed/generalized (logging + device helpers), CUDA-specific bits moved behind the abstraction.csrc/rewired —allocator,ftensor,page,page_allocator,torch_bindings,mem_info_trackernow go throughgpu_vmm.hppandat::/c10::types instead of CUDA-only calls.setup.py— auto-detects the backend fromtorch.version.hipvstorch.version.cuda; builds aCppExtensionlinkingamdhip64on ROCm,CUDAExtensionlinkingcudaon NVIDIA.hipdevice strings (PyTorch-ROCm masquerades GPUs ascuda, but the asserts are now backend-agnostic:cuda/hip).version_utilsfalls back to package metadata whensglang.__version__is absent (so the patches don't silently no-op), and thescheduler_memory_leakpatch generalizes across SGLang's old/new leak-check layouts while leaving thereq_to_token_poolcheck intact. (Also submitted standalone as thesglang-0512-compatPR.)KVCACHED_CONTIGUOUS_LAYOUT=falseon HIP (see note below); README updated.benchmarks/bench_vmm— now cross-platform via the same abstraction (makefor CUDA,make KVCACHED_BACKEND=hipfor AMD).tests/test_elastic_serving.py(new) — e2e elasticity check (grow/shrink under load), complements the manager-leveltest_kvcache_manager.py.Why non-contiguous layout is the ROCm default
kvcached supports two KV layouts. The historical default, contiguous, packs all layers into one interleaved tensor, so a per-layer view is strided (
is_contiguous=False). vLLM's ROCm paged-attention backend slices the KV cache with.view()+ paged kernels that assume a standard contiguous per-layer stride, so the strided views produce wrong output on AMD. The non-contiguous layout gives each layer its own standard contiguous tensor, which the ROCm kernels handle correctly. (On NVIDIA, FlashAttention/FlashInfer use stride-tolerantunbind/varlenpaths, so contiguous works there.) We therefore auto-select non-contiguous on HIP; it remains overridable viaKVCACHED_CONTIGUOUS_LAYOUT.Validation (AMD MI300X, ROCm 7.0)
setup.pyHIP path compiles the wholecsrc/bench_vmm(HIP) runs;mem_mapp50 ~4 µs,mem_unmap~82 µstest_elastic_serving.py: mapped KV grows (hipMemMap) under load and shrinks (hipMemUnmap) on free; output unchangedEnvironment
AMD — primary validation (MI300X):
0x74b5), 192 GiB HBM3rocm/vllm:rocm7.0.0_vllm_0.11.2_20251210— Python 3.12.12, PyTorch 2.9.0a0+git1c57644 (torch.version.hip 7.0.51831-a3e329ad8), vLLM 0.11.2.dev673+g839868462lmsysorg/sglang:v0.5.12.post1-rocm700-mi30x— Python 3.10.12, PyTorch 2.9.0a0+git7bcbafe (torch.version.hip 7.0.51831-a3e329ad8), SGLang 0.5.12.post1NVIDIA — CUDA non-regression (Spark):
/usr/local/cuda)torch.version.hip = None)Known limitations
arange's range must be a power of 2) that reproduces on the unmodified engine too — an engine/Triton-on-ROCm issue, not a kvcached regression.Scope & next steps
This PR establishes portability and correctness on AMD (see the validation table) — it deliberately does not include performance benchmarking. AMD benchmarking is the next step and is tracked separately on the
amd-benchmarkbranch (serving overhead vs vanilla, multi-instance elastic sharing, and VMM-op latency), kept out of this PR to keep it focused on the port itself.Note
The SGLang-compat fixes (version detection + leak-check) are committed in this branch, so it serves SGLang on AMD standalone — no external dependency. The same fixes are also submitted as the focused
sglang-0512-compatPR#353; the two carry identical changes to those files, so whichever merges second simply re-applies the same lines.Commits
feat: support AMD GPUs (ROCm/HIP) via gpu_vmm abstractionfeat(amd): accept hip device strings in sglang/vllm integrationfix(amd): default to non-contiguous KV layout on ROCmtest(amd): add e2e KV-cache elasticity-under-load testfix(sglang): version detection + refactored leak check (SGLang 0.5.11+)